Stemming and Lemmatization with Python and NLTK

您所在的位置：网站首页 › difference between stemming and lemmatization › Stemming and Lemmatization with Python and NLTK

Stemming and Lemmatization with Python and NLTK

2023-07-30 01:50| 来源: 网络整理| 查看: 265

November 23, 2017

Stemming and lemmatization are essential for many text mining tasks such as information retrieval, text summarization, topic extraction as well as translation.

Stemming

It allows us to remove the prefixes, suffixes from a word and and change it to its base form. However, this stem form might not exist in dictionary. Let鈥檚 take a look at how NLTK stems words.

Using PorterStemmer

Porter stemmer is the most commonly used stemmer because of its good results.

#let's import the libraries from nltk.stem import PorterStemmer # the most commonly used stemmer ps = PorterStemmer() print ps.stem("lying"), ps.stem("lies"), ps.stem("lied") lie lie lie Using LancasterStemmer

Lets compare our results with LancesterStemmer which is based on is based on the Lancaster stemming algorithm. It has more than 120 rules for getting stem words.

#let's import the libraries from nltk.stem import LancesterStemmer # the most commonly used stemmer ls = LancesterStemmer() print ls.stem("lying"), ls.stem("lies"), ls.stem("lied") lying lie lied

We can see the difference between the outputs of these two algorithms. There is also SnowballStemmer, which supports other languages besides english.

Lemmatization

Lemmatization is quite similar to stemming, as it also converts a word into its base form. However the root word also called lemma, is present in dictionary. It is considerably slower than stemming becasue an additonal step is perfomed to check if the lemma formed is present in dictionary.

Note: We also have to specify the parts of speech of the word in order to get the correct lemma. Words can be in the form of Noun(n), Adjective(a), Verb(v), Adverb(r). Therefore, first we have to get the POS of a word before we can lemmatize it.

First let鈥檚 import the libraries.

from collections import Counter from nltk.corpus import wordnet # To get words in dictionary with their parts of speech from nltk.stem import WordNetLemmatizer # lemmatizes word based on it's parts of speech

Okay, now we have to get the POS of a word. For this pupose, we can use Wordnet corpus. It returns all the POS rating of a word in a list. I have written a function for it.

def get_pos( word ): w_synsets = wordnet.synsets(word) pos_counts = Counter() pos_counts["n"] = len( [ item for item in w_synsets if item.pos()=="n"] ) pos_counts["v"] = len( [ item for item in w_synsets if item.pos()=="v"] ) pos_counts["a"] = len( [ item for item in w_synsets if item.pos()=="a"] ) pos_counts["r"] = len( [ item for item in w_synsets if item.pos()=="r"] ) most_common_pos_list = pos_counts.most_common(3) return most_common_pos_list[0][0] # first indexer for getting the top POS from list, second indexer for getting POS from tuple( POS: count )

Okay, now lets create the WordNetLemmatizer object and then perform the lemmantization. It lemmatize method takes two arguments, one is the word to lemmatize and second is the POS of the word.

words = ["running","lying","cars","m!spleed"] wnl = WordNetLemmatizer() for word in words: print wnl.lemmatize( word, get_pos(word) ), #printing without newline character run lie car m!spleed Difference between Stemming and Lemmatization

The difference between stems and lemmas is that lemmas are present in dictionary and stems might not be present in dictionary. Okay this piece of code for demonstration will use stuff from above.

print "Stemming results:", print ps.stem("deactivating"), ps.stem("deactivated"), ps.stem("deactivates") print "Lemmatization results:", words = ["deactivating","deactivated","deactivates"] wnl = WordNetLemmatizer() for word in words: print wnl.lemmatize( word, get_pos(word) ), #printing without newline character Stemming results: deactiv deactiv deactiv Lemmatization results: deactivate deactivate deactivate

Alright, that concludes our demonstration for stemming and lemmatization using NLTK in Python.

#text-mining

【本文地址】

Stemming and Lemmatization with Python and NLTK

Stemming and Lemmatization with Python and NLTK

今日新闻

推荐新闻